-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding new ARC job runner #16653
base: dev
Are you sure you want to change the base?
Adding new ARC job runner #16653
Conversation
Adding pyarcrest conditional requirement for the ArcRESTJobRunner
ArcRESTJobRunner
Require version 0.2 which is compatible with the ARC job runner.
I have a couple refactorings I consider relatively important - can you pull them out of https://github.com/jmchilton/galaxy/tree/arc_job_runner. This brings it inline with other job runners, makes it more isolated for potential unit testing, simplifies some big overloaded methods in the runner, and would allow a lot of this logic to be reused with a potential Pulsar job runner. I think Pulsar is the way you want to go long term but I am happy to allow a Galaxy job runner for this as long as you don't get into like command-rewriting logic - which there isn't here - you're staging paths as is. That staging could be more targeted with Pulsar but I'm fine with being weary of premature optimization. I haven't tested my changes - I think the refactorings are vaguely correct but I probably introduced some bugs - hopefully if there are regressions they will be easy to track down but if I broke something and you want advice - happy to work with you on that. My next refactoring would be this.... almost global ARCJob thing? You're reusing the same object for every job and every request? Shouldn't this be somehow constructed per job? Is there an issue with that? I'd be very anxious about threading and cleaning things up and such as things are now? Also the job object has a I think addressing these three points would make it easier for me to grok what is going on with the job runner and provide additional feedback. |
lib/galaxy/jobs/runners/arc.py
Outdated
|
||
|
||
def job_actions(self, arcid, action): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - can you just make this two separate methods with two separate returns. This sort method dispatching weakens our tool chains ability to do static checking and I think makes the code a little bit harder to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will look at it!
lib/galaxy/jobs/runners/arc.py
Outdated
log.error(f'Could not get status of job with id: {arcid}') | ||
elif action == "kill": | ||
results = self.arcrest.killJobs([arcid]) | ||
for killed in results: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will return after the first iteration, so really there's no iteration - is that intentional?
Also, you could just return bool(killed)
instead of the if/then.
lib/galaxy/jobs/runners/arc.py
Outdated
self._init_monitor_thread() | ||
self._init_worker_threads() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods are called automatically on galaxy startup; the logic goes like this: app.py > JobManager().start() > JobHandler().start() > DefaultJobDisplatcher().start() > for each job runner runner.start() where both methods are called here. Unless this is a special case (which I missed)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not review/check if this is necessary, I just followed https://docs.galaxyproject.org/en/master/dev/build_a_job_runner.html
Maybe it is outdated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that part of the documentation can be interpreted both ways (i.e., this is what the runner does vs. this is the code that should be added). I'd leave it out - it should work: all runners start those threads, which is why it's handled in the base runner.
To get rid of the linting errors, you can just |
Thanks for looking closely at this and spotting issues and suggesting improvements. I will implement the changes. |
Thanks a lot for these suggestions, improves code a lot. I have applied the changes, and will test now. About the ARCJob - I might not need this anymore in fact. It was needed in an earlier version of pyarcrest, but in the meantime we have done some refactoring and I think it is of no use anymore. So will look into that now. And yes, your concern is very correct related to the ARCJob per job issue, I noticed what I had done the other day and noted that I had to have a look at what I had actually implemented there, and that it seemed wrong. |
…cessary job_actions method. Further improvements to the action on jobs will come.
I am rewriting to be able to use the job_runner_external_id. However, in check_watched_item - I seem to have difficulties getting id. Is It returns Even though I in
But in And by the way: using a getter to get the id, it would be nice to use a setter to set it. But I see the setter is an attribute of the Task object, and not the Job object. Any opinions on this? |
Sorted this out. |
About testing - so finally (sorry for the delay and multiple tries) fixed whitespace linting errors and other stuff actively using
Now trying to sort out the
error I see here, but running locally all is ok.
But in the repo tests I see:
What to do to fix this error? |
I've restarted those tests - they've just passed, so the previous error was unrelated. |
The failed jobs - not sure how I should fix this. Is it related or unrelated to my changes? |
This feature is a prototype created in the context of ESG.
A new job runner - for submitting jobs to a remote ARC middleware endpoint. Behind the ARC endpoint is a regular batch system (slurm, htcondor etc). ARC endpoints are typically in front of designated grid infrastructures or HPC systems shared by other users.
The use-case is a user that has access and compute-time to one or several HPC systems that are served by ARC. Instead of using the traditional command-line approach to send the jobs, Galaxy can be used. The authentication is handled by ARC, and the uploading of remote input datasets is also handled by ARC. The user would use the regular way of authenticating himself with ARC via tokens issued by one of the endpoints supported OIDC token providers. For this to be convenient the user should log into Galaxy using this supported OIDC token provider. Currently a WLCG IAM backend is in PR to social_core (python-social-auth/social-core#820) which is one of the supported IdPs. The Galaxy server must be configured to use this oidc as a login-option by configuring the oidc_backends_config.yml. See details here: #16617
Currently, the user must himself configure his user preferences with a single ARC endpoint to use. There is a feature request to ask for the possibility for the user to add as many endpoints are wanted: #16596
The form allowing users to add an ARC endpoint is currently merged into the usegalaxy-eu infrastructure-playbook: usegalaxy-eu/infrastructure-playbook#879 . If you want to try this out, just have a look at this and apply the small additions to the relevant files.
Currently also, the runner only works with a specific tool designated for ARC jobs. It has not yet been made ready for the using with toolbox tools. The arc-tool config file looks like this:
The tool expects that you upload an executable. In this case a simple bash script with the following contents:
But the executable could very well be some scientific code in C++ or whatever, as long as the ARC endpoint supports this software (this will be handled properly with so-called ARC runtime environment settings and matchmaking later). Currently this tool allows upload of a single executable and no other input-data just for testing purposes. In the future the job runner will support uploading of arbitrary number of input files from galaxy, in addition to remote input files (ARC collects these from the remote source).
Once the job is done, all the output files created by the jobs executable and ARC itself are uploaded into Galaxy as expected.
How to test the changes?
Instructions for manual testing are as follows:
For the galaxy admin:
For the user:
License